Team: Preeti Swaminathan, Patrick McDevitt, Andrew Abbott, Vivek Bejugama
10-dec-2017
Github link: https://github.com/bici-sancta/mashable/tree/master/lab_03
4.0 Modeling and Evaluation 1 : Train and adjust parameters
5.0 Modeling and Evaluation 2 : Evaluate and Compare
6.0 Modeling and Evaluation 3 : Visualize Results
7.0 Modeling and Evaluation 4 : Summarize the Ramifications
The code sequence to execute all of the analysis and images provided in this section of the report is as follows :
Any of the clustering routines (03, 04, 05, can be run independently following the 01 data prep and 02_ tsne dimensional reduction routines.
Professor Jake,
We have organized this project different from other submissions.
Reasons
Code base for Modeling and Evaluation 1 thru 4 is moved to different file. The plots you see in this file are images saved from code and inserted here. We had to do this as Clustering with t-SNE kept changing the grouping with each execution. Our code base is neatly ordered as mentioned in the structure above and it has been execute.
We also had to split the file into multiple files as the size was getting bulkier.
Table of conents above is a hyperlinked content. You can click on the main heading to directly go to the section. From any section you can return back to Table of contents section by clicking on the hyperlink "Return to Table of Contents" provided under each section in the document.
Thanks,
Team
from IPython.display import Image
We are using online news popularity dataset from UCI machine learning repository. The dataset is a collection of 61 heterogeneous set of features of approximately 40,000 articles published by Mashable (www.mashable.com) - the features are not the articles, but features extracted from the article, such as word counts, title word counts, keyword associations. The data represents a two year period of published articles, ending in January 2015.
We intend to perform cluster analysis on this data set.
The question from the business is to identify characteristic groupings from among the articles published in order to provide product insight to the product owners. In this case the product owners are the editorial owner of each data channel. Each data channel has its own editorial content owner and these owners have editorial control over what is published within their channel. The product owners have requested a cluster analysis to develop a better understanding of the characteristics over which they have control (e.g., length of articles, visual vs textual content volume, positive and negative sentiments, and cross-channel composition). The objective here is to provide this baseline cluster analysis; the content owners desire to then conduct A-B testing in future months within these characteristics to then understand if that can attract additional or different readership. To support this objective, the specific task at hand is to provide baseline cluster definitions, develop the method for deploying the cluster analysis such that the method is available for additional execution on an approximate monthly or bi-monthly basis. The repetivite analyses will assist to understand if the content owners are in fact successful in managing intended change (comparative cluster analysis to current baseline) and if those changes can be identified to promote or detract from additional readership.
The developers of the data set have requested the below citation request for any use of the data.
The data is located at https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
Citation Request :
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.
For the clutering analyses, we will evaluate different clustering methods and, for each clustering method, we will generate results using a wide range of appropriate values for the parameters that control the outputs of each cluster method. Each clustering method has different parameters and in some cases different options for performance evaluation. In all cases, we will be able to assess the silhouette score and also the process execution time for each modeling technique. For the range of parameters evaluated, we will select the model method that provides the highest silhouette score. The evaluation of execution time will be less definitively quantitative at this time; we will report if there are significant issues for extraordinarily long process time that would indicate a potential problem in deployment.
Subsequent to the clustering analyses, a comparative analysis across clustering methods will be provided. An element of success in this regard will be the relative consistency across clustering methods. If different clustering methods provide similar views of the cluster characteristics, we will use that as an indication that there is robustness and validity for the clusters identified. This comparative analysis will be somewhat qualitative in nature, but for the purposes of this first evaluation, and for the nature of this specific data set, we submit that that will provide a sufficient basis to identify if there is merit in further developing this cluster analysis concept. If unique characteristics are identifiable across the clusters, and if this characteristics are ostensibly controllable by the product owners for future modification, then the clustering is considered sufficiently successful for the current purposes.
The first part of the validation involves selecting the controlling parameters for each clustering method. This will be done using generally accepted standards, i.e., maximizing around silhouette scores which establishes that the selected parameters for each cluster method produces outputs with relative high cohesion / separation ratios for the defined clusters.
The additional element of validation is related to identifying actionable cluster definitions. The business objectives at this time are to (a) develop some baseline understanding, and (b) identify levers for content control in future months. To be consistent with this objective, provided the cluser definitions identify modifiable characteristics within each cluster definition satisfies the current objective.
Attribute Information:
0. url: URL of the article
1. timedelta: Days between the article publication and the dataset acquisition
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.simplefilter('ignore',DeprecationWarning)
import seaborn as sns
import time
#Import Data from .csv file
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... change directory as needed to point to local data file
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
df = pd.read_csv('../data/OnlineNewsPopularity.csv')
# Strip leading spaces and store all the column names to a list
df.columns = df.columns.str.strip()
#col_names = df.columns.values.tolist()
Step 1: Using the standard read.csv function, all the variables are imported as data type float64. Many of the values are more logically integer (counts) or boolean (e.g., is_weekend). We will convert those fields to data types appropriate to their nature.
Step 2: Using the 'duplicated' function in python, we confirmed that there are no duplicated data in this data set.
Step 3: Using standard python functions, we can also see that there are no missing values for any of the data cells either.
Note: So, those two traditional questions in the data cleaning phase do not need any specific action for this data set.
Step 4: There are, however, outliers in some of the data columns. Based on observation from scatter plots, histograms, and the evaluation of descriptive statistics (especially skewness) we consider to transform all of the variables with high right-skewness (i.e., skewness > 1) using a log transformation. Since the purpose of this data set and evaluation is to perform cluster analysis, performing log transform on right skewed data may prove beneficial in improving the distance characteristics in a way thst supports clustering analysis.
Step 5: Subsequent to performing the log transform, we can further evaluate for outliers in that tranformed data space.
Step 1: convert to appropriate data type
# Converting the data type to Integer
to_int = ['timedelta','n_tokens_title', 'n_tokens_content','num_keywords',
'num_hrefs','num_self_hrefs', 'num_imgs', 'num_videos','shares' ]
df[to_int] = df[to_int ].astype(np.int64)
Step 2: Identify duplicates
# Check for duplicates
df[df.duplicated()]
From above, we confirmed that there are no duplicated data in this data set.
Step 3: Identify missing values
df.describe().T
From above, we can also see that there are no missing values for any of the data cells either.
Step 4 & 5 :: Identify Outliers & transform data
df["shares"].describe().T
Quartiles
1: 946
2: 1400
3: 2800
This is consistent with the stated business objective to characterize
< 946 as Regular
between 946 to 1400 as Good
between 1400 and 2800 as Popular
anything higher as Viral
There are 60 columns in the original data set; we added a few additonal columns based on observed opportunities (e.g., _publicationdate, ...) as explained above.
From this data set, we did a simple correlation matrix to look for variables that are highly correlated with each other that could be removed with little loss of information.
With that downselection, we proceeded with additional evaluation of these remaining variables.
we recognize that there is likely significant additional opportunity for modeling improvements with many of the remaining variables, and will look to re-expand the data set to further consider that with future work.
sns.set(style="white")
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 13))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
# from example found at https://www.kaggle.com/maheshdadhich/strength-of-visualization-python-visuals-tutorial/notebook
Using the correlation matrix above along with judgement we removed the following variables:
# Clasifing Atributes for easy analysis
dropped_features = ['url','timedelta','num_self_hrefs','n_unique_tokens','average_token_length','kw_min_min','LDA_03','LDA_04',
'global_subjectivity','min_positive_polarity','max_positive_polarity',
'min_negative_polarity','max_negative_polarity','global_sentiment_polarity',
'n_non_stop_words','n_non_stop_unique_tokens','kw_max_min','kw_avg_min',
'kw_min_max','kw_max_max','kw_avg_max','kw_min_avg','kw_max_avg',
'rate_negative_words','avg_positive_polarity','self_reference_min_shares',
'weekday_is_saturday','weekday_is_sunday','self_reference_max_shares','title_subjectivity',
'shares','rate_positive_words','abs_title_sentiment_polarity']
df1 = df.drop(dropped_features, axis = 1)
# Compute the correlation matrix
corr = df1.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 13))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
There are 60 columns in the original data set; we added a few additonal columns based on observed opportunities (e.g., _publicationdate, ...) as explained above.
From this data set, we did a simple correlation matrix to look for variables that are highly correlated with each other that could be removed with little loss of information.
With that downselection, we proceeded with additional evaluation of these remaining variables.
we recognize that there is potentially additional opportunity for modeling improvements with some of the remaining variables, and will look to re-expand the data set to further consider that with future work.
imp_features = ['n_tokens_title',
'n_tokens_content',
'num_hrefs',
'num_imgs',
'num_videos',
'num_keywords',
'kw_avg_avg',
'self_reference_avg_sharess',
'LDA_00',
'LDA_01',
'LDA_02',
'global_rate_positive_words',
'global_rate_negative_words',
'avg_negative_polarity',
'title_sentiment_polarity',
'abs_title_subjectivity']
for var in imp_features:
df1.boxplot(column = var)
plt.show()
This section contains histograms and cross tabs of several important variables and the popularity rates for each group of variables. The histograms show the distributions of several important variables.
After looking at the boxplots of these variables it is evident that many are heavily skewed. Before they are used for prediction a log transformation would be beneficial. The following bit of code makes those transformations, creating new variables.
# ---------------------------------
# Log transform variables with high skewness
# ---------------------------------
log_features = ['n_tokens_content',
'num_hrefs',
'num_imgs',
'num_videos',
'kw_avg_avg',
'self_reference_avg_sharess',]
df1 = df
# store min value for each column
df_mins = df1[log_features].min()
for column in log_features:
sk = df1[column].skew()
if(sk > 1):
new_col_name = 'ln_' + column
print (column, sk, new_col_name)
if df_mins[column] > 0:
df1[new_col_name] = np.log(df1[column])
elif df_mins[column] == 0:
df_tmp = df1[column] + 1
df1[new_col_name] = np.log(df_tmp)
else:
print('--> Log transform not completed :', column, '!!')
plt.hist(df1['n_tokens_title'], bins = 20)
plt.xlabel('Number of title words')
plt.ylabel('Frequency')
plt.show()
First is the distribution of the number of words contained in the title. This variable is normally distributed and does not require transformation.
plt.hist(df1['n_tokens_content'], bins = 20)
plt.xlabel('Number of words')
plt.ylabel('Frequency')
plt.show()
plt.hist(df1['ln_n_tokens_content'], bins = 20)
plt.xlabel('Log number of words')
plt.ylabel('Frequency')
plt.show()
The distribution of the untransformed data is heavily skewed. the tranformed distribution shows the improvement.
df_channel = df1
df_channel['data_channel'] = np.NaN
condition = df['data_channel_is_lifestyle'] == 1
df_channel.loc[condition, 'data_channel'] = 'Lifestyle'
condition = df['data_channel_is_entertainment'] == 1
df_channel.loc[condition, 'data_channel'] = 'Entertainment'
condition = df['data_channel_is_bus'] == 1
df_channel.loc[condition, 'data_channel'] = 'Business'
condition = df['data_channel_is_socmed'] == 1
df_channel.loc[condition, 'data_channel'] = 'SocMed'
condition = df['data_channel_is_tech'] == 1
df_channel.loc[condition, 'data_channel'] = 'Tech'
condition = df['data_channel_is_world'] == 1
df_channel.loc[condition, 'data_channel'] = 'World'
df_channel = df_channel.groupby(by=['data_channel'])
channel_count = df_channel['data_channel'].count()
_=channel_count.plot(kind='barh', stacked=True, color = ['blue'])
plt.hist(df1['global_rate_positive_words'], bins = 20)
plt.xlabel('Global rate of positive words')
plt.ylabel('Frequency')
plt.show()
As a first exploratory for the relationships among the available variables, we choose to select 9 variables from 4 of the above categories that appear, on the surface, to be highly relevant to the business case objective.
In this data set, _tokenscount refers to the number of words in the published article. As we wish to understand word counts, image counts, and video counts are likely independent predictors of number of eventual shares, to understand their respective utility in a predictive model, we evaluated scatter plots of (1) number of tokens vs. number of images, (2) number of tokens vs. number of videos and (3) number of images vs. number of videos.
These are shown in below plots.
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... Tokens, Images, Videos plots
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import numpy as np
df1.log_n_tokens = np.log(df1['n_tokens_content'])
df1.log_n_imgs = np.log(df1['num_imgs'])
df1.log_n_videos = np.log(df1['num_videos'])
plt.plot(df1.log_n_tokens, df1.log_n_imgs, label = 'Tokens-content - NumImages', linestyle = 'None', marker = 'o')
plt.xlabel('Tokens-content')
plt.ylabel('Images')
plt.title('mashable characteristics')
plt.legend()
plt.show()
plt.plot(df1.log_n_tokens, df1.log_n_videos, label = 'Tokens-content - NumVideos', linestyle = 'None', marker = 'o')
plt.xlabel('Tokens-content')
plt.ylabel('Videos')
plt.title('mashable characteristics')
plt.legend()
plt.show()
plt.plot(df1.log_n_imgs, df1.log_n_videos, label = 'NumImages - NumVideos', linestyle = 'None', marker = 'o')
plt.xlabel('Images')
plt.ylabel('Videos')
plt.title('mashable characteristics')
plt.legend()
plt.show()
We can make the following observations :
_(ref : https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)_
LDA (in this context) refers to a method by which a body of text can be scored relative to a vocabulary set that is identified with specific topics. A body of text that discusses Machine Learning, for instance, will use vocabulary specific to that topic, and quite different from another text which discusses do-it-yourself home repair. A text can be scored relative to the similarity of a given LDA scale and then compared among other texts for similarity or difference.
This data set includes measures for 5 LDA topics, identified here as : _LDA00, _LDA01, ... and _LDA04.
Similar to the above visualization, we choose to review the relative visual correlation among the LDA scores of these articles via scatter plots, this time with each LDA score plotted against _LDA00 (vs. _LDA01, _LDA02, etc.) and a basic histogram of each individual distribution. These are shown in the below plots.
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... LDA plots
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
num_bins = 20
plt.hist(df1.LDA_00, num_bins, facecolor='blue', alpha=0.5)
plt.title('LDA 00')
plt.show()
num_bins = 20
plt.hist(df1.LDA_01, num_bins, facecolor='slateblue', alpha=0.5)
plt.title('LDA 01')
plt.show()
num_bins = 20
plt.hist(df1.LDA_02, num_bins, facecolor='mediumorchid', alpha=0.5)
plt.title('LDA 02')
plt.show()
plt.plot(df1.LDA_00, df1.LDA_01, label = 'LDA_00 - LDA_01', linestyle = 'None', marker = 'o')
plt.xlabel('LDA_00')
plt.ylabel('LDA_01')
plt.title('mashable characteristics')
plt.legend()
plt.show()
plt.plot(df1.LDA_00, df1.LDA_02, label = 'LDA_00 - LDA_02', linestyle = 'None', marker = 'o')
plt.xlabel('LDA_00')
plt.ylabel('LDA_04')
plt.title('mashable characteristics')
plt.legend()
plt.show()
We can make the following observations :
The data set contains an interesting feature idea for this evaluation. The links that are embedded in each article, prior to publication, are evaluated for the keywords in those links and the number of social media shares associated to those embedded links. These are then scored relative to the success rates of the each of the keywords on a best/avg/worst basis. This is an interesting concept, in essence, to use previous related social media share success / failure keywords as an estimator of a to-be-published article. Since this is an interesting, and potentially useful, feature in this data set, we decided to explore with a simple data view to visualize the relationship between _keywords_avgmax and _self_referenceshares, to verify that the distributions are understood and also that there is non-dependency between features such as _kw_avgmax and _self_referenceshares.
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... Keywords visualizations
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import numpy as np
num_bins = 20
plt.hist(df1.kw_avg_max, num_bins, facecolor='mediumorchid', alpha=0.5)
plt.title('ln_self_reference_avg_shares')
plt.show()
df1.log_self_share = np.log(df1['self_reference_avg_sharess'] + 1)
axes = plt.gca()
axes.set_xlim([4,14])
num_bins = 20
plt.hist(df1.log_self_share, num_bins, facecolor='slateblue', alpha=0.5)
plt.title('ln_self_reference_avg_shares')
plt.show()
axes = plt.gca()
axes.set_ylim([4,14])
plt.plot(df1.kw_avg_max, df1.log_self_share, label = 'kw_avg_max - self_ref_avg', linestyle = 'None', marker = 'o')
plt.xlabel('kw_avg_max')
plt.ylabel('ln_self_reference_avg_sharess')
plt.title('mashable characteristics')
plt.legend()
plt.show()
From above plots, we observe the following :
Under this section we will talk a bit about our exceptional work t-SNE. We will follow the below steps
- Non-Hierarchical: K-means, Spectral
- Hierarchical: Linkage type Ward, Average, Complete.
The base data set from which we are starting has appoximately 35 features. The data set was cleaned and pre-processed for analysis, as outlined in the prior sections of this report - missing values identified, outliers dispositioned, and also all features re-scaled to standard normal distribution.
In the nominal data set some features are naturally scaled from 0 to 1 (real values), such as the Latent Dirichlet Allocation (LDA) measures, while other features are measured in the range of 0 - 800,000 ! (e.g. number of shares in social media context). Since both dimensionality reduction and cluster analyses depend on relative magnitudes, all features were mapped to standard normal distribution to provide even weighting of all features in the mapping / clustering processes. The binary features (e.g., is_data_channel_technology) are all retained as binary 0/1 valued features and one-hot encoded to similarly support evenly distributed distance evaluations among such categorical features.
Early efforts in which we attempted to use the cleaned data set and perform cluster analyses yielded results which did not provide straightforward interpretations of the clustering results. Visually, the cluster maps did not provide well-organized presentations of clusters and the silhouette and distortion metrics were generally disorganized as a function of the number of clusters - these metrics were not smooth functions that indicated in any clear sense an optimal or even preferred number of clusters from those analyses. Methods attempted at that point included k-means, DBSCAN, and Spectral Clustering.
Thus, we were motivated to explore dimensionality reduction as a means to simplify the data set that we presented to the clustering algorithms. Evaluating choices for dimensionality reduction we considered Principal Components Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Between the two methods, we decided to evaluate t-SNE.
Image("./plots/tsne/t_sne_divergence_process_time.png")
Image("./plots/tsne/t_sne_clusters_all_together.png")
Having completed the t-SNE mapping, the next step in the process was to apply different clustering methods for evaluation of appropriate clustering results.
In our evaluation, we chose to evaluate with
These three methods have fundamental differences and we assessed that they can provide different opportunities to identify different resulting cluster definitions.
Image("./plots/kmeans/kmeans_7_cluster_map.png")
Image("./plots/spectral/spectral_8_cluster_map.png")
Under this section we will build 3 hierarchical models based on the linkage methods (Ward, Complete, Average). We will compare the 3 models behaviour, identify any patterns within.
To identify number of clusters needed we will be using dendrogram.
Once the models are run, we will look into summary table for silhoutte value and processing time.
Steps: The process for implementing the hierarchical clustering was as follows :
Under this section we will evaluate and compare each cluster algorithm.
We will evaluate the algorithm as below:
the inertia score provides the sum of squared distances of each value to its assigned cluster centroid. A lower inertia score indicates lower variance within the set of clusters. As this is a continually decreasing function relative to the number of clusters, typical practice is to identify an 'elbow' in the inertia vs. number of clusters plot as an indicator of optimal number of clusters. That method is used here to identify that approximately 7 clusters (or slightly greater) can be considered the range of appropriate number of clusters.
Thus, by standard measures, appropriate choices for number of clusters from this k-means clustering analysis is 7.
The 7 cluster set will be generated, visualizations provided, and results analysed in subsequent sections of this report.
The 7 cluster map in the t-SNE space is the figure shown above in Section 4.1.2.
Image("./plots/kmeans/cluster_kmeans_number_of_clusters_eval.png")
For this evaluation, we plotted the average silhouette score for each K value for the range of clusters. From the plot below, the silhouette score indicates that a clustering of 7, 8, 9, or 10 provides the highest range silhouette scores; the local maximum is observed at 8 clusters.
Thus, by silhouette score, appropriate choices for number of clusters from this spectral clustering analysis is 7 to 10 clusters; we will evaluate the analysis with 8 clusters as that provides local maximum in silhouette score.
The 8 cluster map in the t-SNE space is the figure shown above in Section 4.1.2.
Image("./plots/spectral/cluster_spctrl_number_of_clusters_eval.png")
Under this section we will build 3 hierarchical models based on the linkage methods (Ward, Complete, Average). We will compare the 3 models behavior, identify any patters within.
To identify number of clusters needed we will be using dendrogram.
Once the models are run, we will look into summary table for silhoutte value and processing time.
Steps: The process for implementing the hierarchical clustering was as follows :
Evaluation
The figures below present the dendogram map and the corresponding cluster separation visualization as displayed on the t-SNE 2-D space for each of the above 3 linkages.
Image("./plots/hierarchical/dendrogram_average.png")
Image("./plots/hierarchical/hc_average_eval.png")
Image("./plots/hierarchical/dendrogram_ward.png")
Image("./plots/hierarchical/hc_ward_eval.png")
Image("./plots/hierarchical/dendrogram_complete.png")
Image("./plots/hierarchical/hc_complete_eval.png")
Looking at dendogram, we come up with following optimal clusters
We will run models for these values and create and evaluation matrix.
Image("./plots/hierarchical/hc_evaluation_matrix.png")
Hierarchical cluster matrix above shows an average to below average silhouette score for all 3 models. Although ward show a score of 41% it isnt a good model as it could not identify more than 2 clusters. This could be because of using T-SNE approach which might have lost distance and density information causing closely clustered data in the t-SNE 2-D mapped space.
Processing time has been good for all 3 models, this good processing time for a hierarchical model could be because of dimension reduction. Since we have reduced all features to just 2 of them.
We will do an in-depth analysis of each feature for each linkage type further.
6. Modeling and Evaluation 3
Under this section, we will visualize each model independently in the t-SNE 2D space displaying the magnitude and distributions of the original features relative to the cluster labels. We will produce 3 plots for each feature to assist with that visualization.
At the end we will plot feature importance plot to summarize how each feature is distributed under each cluster for each model.
We will visualize this section for:
6.1 K-means
6.2 Spectral
6.3 Hierarchical
To evaluate the resulting clusters for each of the k-value = 7 clusters the following approach is taken :
construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :
These plots show the following relationships :
Similarly, an observation about relative participation of each feature in each of the clusters was identified and used in the subsequent interpretations of the clusters.
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_00.png")
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_01.png")
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_02.png")
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_03.png")
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_04.png")
The 3-plot sets above aid in identifying the assocation of each feature relative to the cluster regions.
To further understand the cluster relationships, an additional view is presented below. For each cluster the mean value of each feature in that cluster was determined, the standard deviation of those means, and a z-score of each mean relative to other means in that cluster were compared. The goal was not to assess these z-scores for statistical signficantly different means, but rather as a method to identify in a consistent way the relative participation of each feature in each cluster. The goal is to identify the few most (both positively and negatively) impactful features in defining the cluster characteristics. The median of each cluster, or some other statistic, could have also been used for this purpose. For the clusters developed for this data set, means and medians provide essentially the same view of the major contributors to a cluster characterization.
The plots below show these distributions of means for each feature in each of the clusters developed from the above k-means application.
As an example, we can make the following observations of the clusters based on these plots (and a detailed examination of the underlying values in a data table)
Similary, an interpretation was completed for each cluster based on the relative measure of feature means distribution within each cluster.
As will be shown in subsequent sections, this exercise was repeated for each of the clustering methods deployed. A synopsis of relevant clusters from the overall analysis will be presented in the summary section.
Image("./plots/kmeans/cluster_kmeans_cluster_barplots.png")
To evaluate the resulting 7 clusters from the spectral clustering, a similar approach is taken as was used above for the k-means evaulation :
construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :
These plots show the following relationships :
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_00.png")
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_01.png")
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_02.png")
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_03.png")
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_04.png")
The 3-plot sets above aid in identifying the assocation of each feature relative to the cluster regions.
To further understand the cluster relationships, an additional view is presented. For each cluster the mean value of each feature in that cluster was determined, the standard deviation of those means, and a z-score of each mean relative to other means in that cluster were compared. The goal was not to assess these z-scores for statistical signficantly different means, but rather as a method to identify in a consistent way the relative participation of each feature in each cluster. The goal is to identify the few most (both positively and negatively) impactful features in defining the cluster characteristics. The median of each cluster, or some other statistic, could have also been used for this purpose. For the clusters developed for this data set, means and medians provide essentially the same view of the major contributors to a cluster characterization.
The plots below show these distributions of means for each feature in each of the clusters developed from the above k-means application.
As an example, we can make the following observations of the clusters based on these plots (and a detailed examination of the underlying values in a data table)
Cluster 01
Cluster 04
Similary, an interpretation was completed for each cluster based on the relative measure of feature means distribution within each cluster.
As will be shown in subsequent sections, this exercise was repeated for each of the clustering methods deployed. A synopsis of relevant clusters from the overall analysis will be presented in the summary section.
Image("./plots/spectral/cluster_spctrl_cluster_barplots_horizontal.png")
The 3-plot set can be viewed together for each feature to understand how that feature's relative values are distributed across the clusters, and allows to visualize.
Image("plots/hierarchical/HC_ward_feature_importance.png")
Image("plots/hierarchical/HC_complete_feature_importance.png")
Image("plots/hierarchical/HC_average_feature_importance.png")
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_00.png")
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_00.png")
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_00.png")
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_01.png")
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_01.png")
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_01.png")
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_02.png")
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_02.png")
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_02.png")
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_03.png")
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_00.png")
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_00.png")
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_04.png")
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_00.png")
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_00.png")
We will use these Visualzation plots further section to summarize and conclude.
7.0 Modeling and Evaluation 4
Under this section we will use visualization plots to understanding how features are clustered and arrive at any patters that can be seen.
We will identify which features define each of the clusters in each of the Clustering algorithms. We will compare composition of clusters across algorithms.
After review of the relative positive and negative participation of the features in the clusters, we identified that there are some consistent themes that support a synopsis / summarization of the salient characteristics and identifying contrasting characteristics of the clusters in a way that supports the business interests.
The below table indicates for each cluster the relative strength (positive and negative) of these characteristics in defining a cluster composition.
Image("./plots/kmeans_cluster_summary.png")
Similar to the above summary regarding characterization of K-Means .
The below table indicates for each cluster the relative strength (positive and negative) of these characteristics in defining a cluster composition.
Image("./plots/spectral_cluster_summary.png")
For our analysis, we will look into each of the hierarchical models and compare the behaviour of features under each type. We will start by looking at the comparison matrix.
We arrived at the below matrix using the visualization plots.
e.x. ln_LDA_00 for linkage type 'Ward' has strong data in cluster 0 and almost no data in cluster 1.
Image("./plots/hierarchical/HC_comparisonMatrix.png")
For our analysis, we will look into each of the hierarchical models and compare the behaviour of features under each type. We will start by looking at the comparison matrix.
LDA00 thru LDA04:
We can see that overall LDA00 thru LDA04 are highly impactful in creating clusters for all 3 types of models. What is interesting here is LDA_04, Ward and average methods used this feature values to distinguish clusters but complete spread it evenly betweem all 3 clusters. Looking at the LDAs, we can clearly see that 'Ward' makes high and low value clusters.
data_channel (Entertainment, social media, lifestyle):
Among all data_channels, Entertainment, Lifestyle and Social media features are in our final list of features. entertainment has made an impact on all 3 models in forming clusters. Most entertainment values have formed a strong cluster of its own. Cluster 1 of 'ward', cluster 1 of 'complete' and culster 4 of 'average' can be called high-entertainment cluster. Among the 3, entertain ment has huge data with 7K mashable links while Social media and lifestyle has only around 2k.
Numbers like images, videos:
These features are particularly interesting as they behave differently for each of the models. while low videos form a cluster in complete and average, ward method shows no deviation. The images however form a cluster based on the images in an article. Ward and complete have formed a cluster with high number of images. Also, a high number of images also forms strong bond with entertainment
Global appeal and postivity: ( global_subjectivity, global_rate_positive_words, rate_positive_words, max_positive_polarity)
These set of features are interesting to note as it shows strongly impactful in 'average' but has no impact in other two methods. The 'average' method formed cluster 2 which could be named as low global appeal and low positive words. This is the same cluster which was low on LDA_00 and LDA_04.
Each method by itself:
a. ward: Just 2 clusters, with LDAs being major factors, number of images and data_channel entertainment. Cluster 1 is high images and entertainment, there is strong visual importance here. cluster 0 is low image count and ????
b. complete: 3 clusters, cluster 1 is strong on LDA_01, LDA_03, entertainment and high image count. This is somewhat similar to cluster 1 of ward for high values. While complete forms a different cluster, cluster 2 for low values of these features. while cluster 0 is mostly LDA_02 and low values of LDA_01, LDA_03.
c. average: formed 4 clusters. Cluster 3 is high on LDA_01, LDA_03, images and entertainment. this is consistent with other methods as well. Cluster 0 strong on LDA_00 and LDA_04 and away from entertainment and videos. Cluster 2 seems a very small cluster mostly of LDA_03 values with low entertainment values.
Patterns:
LDA_01, LDA_03, entertainment and imgaes are consistently together. Clearly certain set of vocabulary related to entertainment are found in LDA_01 and LDA_03. Putting this together with number of shares, the high shares under entertainment like "McDonalds Kills Site That Advised Employees to Eat Healthy Meals", "What to Do With Your New Xbox one" have high count of images, more than 10, while low shares under entertainment like "11 TV Characters Older Than the Class of 2018", "Hasbro Games Show Signs of Life in iPad World" show 1 to none images.
Notes: The patterns observed here is specific to the data we took to analyze. We cannot extrapolate these behavior to future mashable data.
The below table presents the compilation of cluster characterization for all models.
For each cluster's primary defining characteristic, i.e., Data Channel Name, the relative scores of the strength of the other contributing characteristics is shown. The comparative analysis is accomplished by reading the table vertically. All of the World grouped clusters are shown in the first two columns, all of the Technology related clusters are shown in next set of columns, etc.
The below table indicates for each cluster the relative strength (positive and negative) of these characteristics in defining a cluster composition. The table also aligns the similar composition clusters between the clusters developed by the K-Means, SpectralClustering, and hierarchical methods.
The comparison between the clustering methods are grouped vertically in the table so that a visual comparison can be made between similarly related channels E.g., the K-Means clusters associated to the World Data Channel are vertically aligned with the SpectralClustering cluster that is also associated to the World Data Channel.
Some observations :
Image("./plots/all_cluster_comparison.png")
After review of the relative positive and negative participation of the features in the clusters, we identified that there are consistent themes that support the below conclusions :
The purpose of this model is to provide a baseline characterization to product owners (data channel editorial content owners) that describes defining features of their respective data channel, and that those features are 'actionable' from the perspective of the editors. The editors plan to make controlled experimental modifications to the content in the coming months, and the purpose of this model is to define the product lines (data channels) by their clusters and the contributing constituents to each of those clusters. The information proposed in this report will form the basis for the content modification experiments.
To an extent, the primary objective of this project is satisfied. A basline characterization of relevant clusters and features has been defined by the work above.
However, as this is the first model of this type on this data set, we recognize that there are opportunites for exploration that may produce measurable improvements if pursued.
The method that we deployed initiated with a dimensionality reduction using t-SNE.
The k-means analysis that we used benefited from the t-SNE mapping, as the resulting 2D t-SNE vectors produced visually had globular 2-dimensional distributions.
The spectral clustering analysis that we performed provided similar cluster definitions as did the k-means analysis. This may be due to the fact that the t-SNE mapped vectors developed near neighbors patterns that are generally amenable to a k-means cluster approach and therefore some of the additional capability that can be achieved with spectral clustering was not fully realizable. Future opportunities should look to alternative dimension reduction techniques and utilize more of the complex geometry clustering available within spectral clustering method. More nuanced views of clusters might be developed in that way.
The hierarchical clustering as deployed on this data set resulted in only 2 - 4 clusters, and in the case of the 4 cluster solution, the 4th cluster was a very small portion of the data set. So, in effect, hierarchical identifed 2 or 3 clusters within the whole domain. Although interesting, for the intended purposes, we find that somewhat limited usefulness. The close coupling of the variables in the rather dense space of the t-SNE map may have allowed the hierarchical clustering, using Euclidean distance, to extend widely across the space and reduce the number of clusters. If this method is used in future evaluations of this data set, some method for additional delineation of the space should be explored.
The current vision for deployment of the model is relatively straightforward :
For our exceptional work we have implemented a dimension reductionality technique t-SNE.
Additionally, the visualizations of the association between cluster maps, feature magnitudes superposed on the maps, and the associated boxplot distributions provided an efficient and useful visualization to understand feature distribution within each cluster.